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Abstract 

We propose a new kernel for biological sequences which borrows ideas and techniques from 
information theory and data compression. This kernel can be used in combination with any 
kernel method, in particular Support Vector Machines for protein classification. By incorporat- 
ing prior biological assumptions on the properties of amino-acid sequences and using a Bayesian 
averaging framework, we compute the value of this kernel in linear time and space, benefiting 
from previous achievements proposed in the field of universal coding. Encouraging classification 
results are reported on a standard protein homology detection experiment. 

1 Introduction 

The need for efficient analysis and classification tools for protein sequences is more than ever a core 
problem in computational biology. In particular, the availability of an ever-increasing quantity of 
biological sequences calls for efficient and computationally feasible algorithms to detect functional 
similarities between DNA or amino-acid sequences, cluster them, and annotate them. 

Recent years have witnessed the rapid development of a class of algorithms called kernel methods 
|2(J| that may offer useful tools for these tasks. In particular, the Support Vector Machine (SVM) 
algorithms provide state-of-the-art performance in many real-world problems of classifying 
objects into predefined classes. SVMs have already been applied with success to a number of issues 
in computational biology, including but not limited to protein homology detection ^JJ El El El 
|5J E] functional classification of genes |171 \'2b\ , or prediction of gene localization . A more 
complete survey of the application of kernel methods in computational biology is presented in the 
forthcoming book |21| . 

The basic ingredient shared by all kernel methods is the kernel function, that measures simi- 
larities between pairs of objects to be analyzed or classified, protein sequences in our case. While 
early-days SVM focused on the classification of vector-valued objects, for which kernels are well 
understood, recent attempts to use SVM for the classification of more general objects have resulted 
in the development of several kernels for strings [271 El CC3 El El El HI E] > graphs , or even 
phylogenetic profiles |25] . 

A useful kernel for protein sequences should have several properties. It should be rapid to 
compute (typically, have a linear complexity with respect to the lengths of the compared sequences) , 
represent a biologically relevant measure of similarity, be general enough to be applied without 
tuning on different datasets, yet efficient in terms of classification accuracy. Such an ideal kernel 
probably does not exist, and different kernels might be useful in different situations. For large-scale 
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or on-line applications, the computation cost becomes critical and only fast kernels, such as the 
spectrum |15) and mismatch |TH] kernels can be accepted. In applications where accuracy is more 
important than speed, slower kernels that include more biological knowledge, such as the Fisher 
|13j or local alignment [2f)] kernels might be accepted if they improve the performance of a classifier. 

Our contribution in this paper is to introduce a new class of kernels for string that are both 
rapid to compute (they have a linear-time complexity in time and memory), while still including 
biological knowledge. The biological knowledge takes the form of a family of probabilistic models 
for sequences supposed to be useful to model general classes of proteins. The ones we consider are 
variable-length Markov chains, also known as context-tree models [2H] or probabilistic suffix trees 
PQ. These models offer three advantages: first, they have been shown to be useful to represent 
protein families pJEJ, second, they can have different degrees of generality by varying the suffix- 
tree, allowing then to model larger or smaller classes of sequences, and third, their structure enables 
us to derive a kernel that can be implemented in linear time and space with respect to the sequence 
length. The last two features would not be shared by more complex models such as hidden Markov 
models j7]. A second source of biological information is represented by a prior distribution on 
the models, including the use of Dirichlet mixtures [Zj to take into account similarities between 
amino- acids. 

As opposed to the classical use of probabilistic models to model families of sequences ^ |Hj or 
to the Fisher kernel, we don't perform any parameter or model estimation. We rather project each 
sequence to be compared to the set of all distributions in the probabilistic models, and compare 
different sequences through their respective projections. The resulting kernel belongs to the class 
of covariance kernels introduced in [22]. Formally, the computation of the kernel boils down to 
computing some posterior distribution for pairs of sequences in a Bayesian framework. The com- 
putation can be performed efficiently thanks to a clever factorization of the family of context-tree 
models using a trick presented in [2H| ■ The resulting kernel can be interpreted in the light of noise- 
less coding theory [B]: it is related to the gain in redundancy when the two sequences compared 
are compressed together, and not independently from one another. 

The paper is organized as follows. In Section [21 we present the general strategy of making 
covariance kernels from families of probabilistic models. In Section |3] we define a kernel for protein 
sequences based on context-tree models. Its efficient implementation is presented in Section 0J and 
some of its properties are discussed in Sectional Experimental results on a benchmark problem of 
remote homology detection is presented in Section El 

2 Probabilistic models and covariance kernels 

A (parametric) probabilistic model on a measurable space X is a family of distributions {Pg, 6 G 0} 
on X , where 6 is the parameter of the distribution Pg. Typically, the set of parameters is a subset 
of R n , in which case n is called the dimension of the model. As an example, a hidden Markov model 
(HMM) for sequences is a parametric model, the parameters being the transition and emission 
probabilities 0. A family of probabilistic models is a family {Pf^ f , f G J 7 , Of G 0f}, where T is a 
finite or countable set, and 0/ C M dim (^) for each / G J 7 , where dim(/) denotes the dimension of 
/. An example of such a family would be a set of HMMs with different architectures and numbers 
of states. Probabilistic models are typically used to model sets of elements X\, . . . ,X n G X, by 
selecting a model / and a choosing a parameter 9j that best "fits" the dataset, using criteria such 
as penalized maximum likelihood or maximum a posteriori probability 0. 

Alternatively, probabilistic models can also be used to characterize each single element X G X 
by the representation <j)(X) = (P^pf)) . If the probabilistic models are designed in 
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such a way that each distribution is roughly characteristic of a class of objects of interest, then the 
representation (j>(X) quantifies how X fits each class. In this representation, each distribution can 
be seen as a filter that extracts from X an information, namely the probability of X under this 
distribution, or equivalently how much X fits the class modelled by this distribution. 

Kernels are real-valued function K : X x X — > R that can be written in the form of a dot 
product K(X, Y) = {ip{X),ij){Y)) for some mapping if; from X to a Hilbert space Given the 
preceding mapping </>, a natural way to derive a kernel from a family of probabilistic models is to 
endow the set of representations (j)(X) with a dot product, and set K(X,Y) = (<j)(X), 4>(Y)) . This 
can be done for example if a prior probability ir(f,d6f) can be defined on the set of distributions 
in the models, by considering the following dot product: 

K(X,Y) = (^(X),cf ) (Y))^Y,<f) [ P f ,e f ( X ) P fM Y )< dd f\f)- « 

By construction, the kernel Q is a valid kernel, that belongs to the class of covariance kernels |23| . 
Observe that contrary to the Fisher kernel that also uses probabilistic models to define kernel, no 
model or parameter estimation is required in Intuitively, for any two elements X and Y the 
kernel Q automatically detects the models and parameters that explain both X and Y well. 

There is of course some arbitrary in this kernel, both in the definition of the models and in the 
choice of the prior distribution w. This arbitrary can be used to include prior (biological) knowledge 
in the kernel. For example, if one wants to detect similarity with respect to families of sequences 
known to be adequately modelled by HMMs, then using HMM models constrains the kernel to 
detect such similarities. We use this idea below to define a set of models and prior distributions 
for protein sequences. 

As the probability of a sequence under the models we define below decreases roughly exponen- 
tially with its lengths, the value of the kernel Q can be strongly biased by differences in length 
between the sequences, and can take exponentially small values. This is a classical issue with many 
string kernels that leads to bad performance in classification with SVM |22l I26| . This undesirable 
effect can easily be controlled in our case by normalizing the likelihoods as follows: 

K <T (X,Y) = J2<f) f Pf,e s {X)^\P ffii {Y)W\^de f \f). (2) 

where a is a width parameter and \X\ and |y| stand for the lengths of both sequences. Equation 
(j2j) is clearly a valid kernel (only the feature extractor <j) is modified), and the parameter a controls 
the range of values it takes. 

3 A covariance kernel based on context-tree models 

In this Section we derive explicitly a covariance kernel for strings based on context-tree models with 
mixture of Dirichlet priors. Context-tree models are Markovian models which define an efficient 
framework to describe constraints on amino-acid successions in proteins, as validated by their 
use in P3IH|- Dirichlet priors offer a biologically meaningful estimation of the likelihood of such 
transitions by giving an a-priori knowledge on the multinomial parameters which parameterize 
Markovian models transitions. 

3.1 Framework and notations 

Starting with basic notations and definitions, let E a finite set of size d called the alphabet. Practi- 
cally speaking E can be thought of the 20 letters alphabet of amino-acids. For a given depth Dei 



3 



Figure 1: Tree representation of a context-tree distribution 



corresponding to the maximal memory of our Markovian models we note X = U™ =Q (E D x E) n the 
set on which we define our kernel. Observe that we do not define directly the kernel on the set 
of finite-length sequences, but rather in a slightly more general framework more amenable to fu- 
ture generalization discussed in Section 15.11 below. M is the set of strings of E shorter than D, 
i.e. M = L)fL E l (note that card(M) = d d ~^ ) , and is the empty word. We thus transform 
sequences as finite sets of (context,letter) couples, where the context is a D-long subsequence of 
the initial sequence and the letter is the element next to it. This transformation is justified by the 
fact that we consider Markovian models below. An element X G X can therefore be written as 
X = {(x l , x h )}i=i„N x where Nx is the cardinal of X (which we will also note \X\) and for all i, 
x i G E D and x H G E. 

3.2 Context-tree models 

Context-tree distributions require the definition of a complete suffix dictionary (c.s.d) T>. A c.s.d 
is a finite set of words of M\{0} such that any left-infinite sequence has a suffix in T>, but no word 
in T> has a suffix in T>. We note L(T>) the length of the longest word contained in T> and Td the 
set of c.s.d V that satisfy L(D) < D. Once this tree structure is set, we can define a distribution 
on X by attaching one multinomial distribution 9 S G S^ 1 to each word s of a c.s.d T>. Indeed, by 
denoting 9 = (6 s ) s£ x> we define a conditional distribution on X by the following equation: 

N x 

Pv,e{X) = J{9 v{xi) {x'% (3) 

i=i 

where for any word m in E D , V(m) is the unique suffix of m in T>. 

We present in Figure ^ an example where E = {A, B,C}, the maximal depth D is set to 3 
and where T> = {A, AB, BB, ACB, BCB,CCB,C}, with corresponding 9 S parameters for s £T>, 
each 9 S being a vector of the three-dimensional simplex S3. We will also note Vr> = {(T>,6) : T> G 
J^D,^ £ ©£>} the set of context-tree distributions of depth D. 

1 E d is the canonical simplex of dimension d, i.e. E d = {£ = (£i)i<i<<i : > 0, Yl & = !}■ 
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3.3 Prior distributions on context-tree models 



Having denned a family of distributions Vb an d recalling @, we define in this section a prior 
probability Tr(D,d8) on Vb- This probability factorizes as w(D,dO) = ir(D)ir(d0\'D), two terms 
which are defined as follows. 



3.3.1 Prior on the tree structure 

Td is the set of complete trees of depth smaller than D. Intuitively it would make sense to put 
more prior weight on small trees than on large trees. Indeed, the number of different trees with 
a given number of leaves increases roughly exponentially with the number of leaves. As a result, 
small trees would have a very low influence compared to big trees if their prior probability was not 
boosted. Following |2H] we define a simple probability tt on Tb that has this property by describing 
a random generation of trees. Starting from the root, the tree generation process follows recursively 
the following rule: each node has d children with probability e, and children with probability 1 — e 

o 

(it is then a leaf). In mathematical terms, this defines a branching process. If we denote by T> the 
strict suffixes of elements of T>, the probability of a tree is given by: 

tt(X>) = J[e Yl (1 - e) = (1 - £ y^{sev\i(s)<D} _ (4) 

° sGD 

S£T> Ua)<D 



3.3.2 Priors on multinomial parameters 

For a given tree T> we now define a prior on 0£> = (S^) 1 '. We assume an independent prior among 
multinomials attached to different words with the following form: 

ir(d0\V) = Yl oj{d6 s ). 

Here uj is a prior distribution on the simplex £<j. Following |2H] a simple choice is to take a Dirichlet 
prior of the form: 



where A is Lebesgue's measure and (3 = {I3i)i=\..d is the parameter of the Dirichlet distribution. 
As it has been observed that mixtures of Dirichlet are a more natural way to model distributions 
on amino-acids |H we propose to use such a prior here. An additive mixture of n Dirichlet 
distributions is defined by n Dirichlet parameters , . . . , f5 n and by the probabilities 7 , . . . , 7™ of 
each mixture (with ^*-=i 7 = -0> an d has the following definition: 



n 

u{d9 s ) = J2l k "f3"(d8s)- (5) 

k=l 

3.4 Triple mixture covariance kernel 

Combining the definition of the kernel © with the definition of the context-tree model distributions 
(jHJ) and of the prior on the set of distributions ifH we obtain the following expression for the 
kernel: 

K a (X,Y) = Y, tP) / Pv,e(X)^PvAY)^ U &V^) • (6) 

V£F D s&V \k=l J 
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We observe that Q involves three summations respectively over the trees, the components of the 
Dirichlet mixtures, and the multinomial parameters. This generalizes the double mixture performed 
in |28| in the context of sequence compression by adding a mixture of Dirichlet, justified by our 
goal to process protein sequences. 



4 Kernel implementation 

The definition of the kernel in (JSJ does not express a practical way to compute it. To do so, 
we propose to adapt the context-tree weighting algorithm, first introduced in [55| . based on a 
factorization of the kernel along the branches of the context-tree. Let us introduce first a few more 
notations. We set, given r £ IN, f3 = (Pi)i<i< r £ (IR + *) r and a = (a.i)i<i< r £ (K. + ) r : 

where F is the Gamma function [7j, /3, = Yli=i A? an d a. = SI=i a «- The quantity Gp(a) corre- 
sponds to the averaging of likelihoods Qe(a) under a Dirichlet prior of parameter (3 for 0. We can 
now divide the algorithm computation into two phases which can be implemented alongside. 

4.1 Defining counters 

The first step of the algorithm is to compute, for e G E and m E M, the following counters: 

N x 
i=\ 

u else 

Counter p m (X) keeps track of the frequency of the motive m in the set X while d m ,e summarizes 
the empirical probability of the apparition of letter e after m has been observed. Finally a mje (X, Y) 
takes into account a weighted average of the transitions encountered both in X and Y. The most 
efficient way to compute those counters is to start defining them when m only goes through visited 
contexts, which are up to Nx + Ny, and then benefit from the following downward recursion on 
the length of the string m when m goes through all suffixes of visited contexts: 

Pm(X) = ^2 p f , m (X), 
feE 

a (v s _ YlfeE Pf.m(X)9f. m! e(X) 

t>m,e{^-) — 7T7\ > 

Pm{X) 

a mie (X,Y) = ^a / . m , e (X,Y). 
feE 
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4.2 Recursive computation of the triple mixture 

We can now attach to each m for which we have calculated the previous counters the value: 

n 

K m (X,Y) =^7%* {a(a m , e (X,Y)) eeE ) , 
fc=i 

which computes two mixtures, the first being continuous on the possible values of 9 weighted by a 
Dirichlet prior and the second being discrete by using the different weighted Dirichlet distributions 
given by the mixture (j k ,f3 k ). Here we assume that the possible values for Gpk are computed 
beforehand and stored in a table. By defining now the quantity T m (X,Y), which is also attached 
to each visited word m and computed recursively: 



?m(X,Y) 



K m {X,Y) iil{m)=D 
(l-e)K m (X,Y) + eU eeE r e . m (X,Y) if l(m) < D 



we compute the third mixture over the different possible tree structures of our complete-suffix 
dictionary by taking into account the branching probability e. Indeed, we finally have, recalling 
is the empty word, that: 

K a (X,Y) = T (X,Y). (7) 

Proof. In order to prove (J7J), let us first fix a tree T> and observe that, for X = (x l , x n )i=i<x x and 
Y = (y\y H )i<i<N Y - 



I qv,e( x ) N " x ®v,e(Y)" Y II E^V^) 

= J e n(n^ e ) fffls ' e(x ' Y) (&v» 



seV \e£E \k=l 

= ni> fc / fn^(-r s ' e(x ' y V(^)) 

s£Vk=l jT,d \e£E / 
n 

= n e ^ k v ( a (°-.« Y ))ee E ) = n k ^ x ^ y )' 

seVk=l s£V 

where we have used Pubini's theorem to factorize the integral in the second line. Having in mind 
(jHJ), we have thus proved that K a (X,Y) = Ylv&r D 7r (^ > ) Tlsec Ks(X, Y). The second part of the 
proof is identical to the one given in |2H] to which we refer to finalize this result. □ 

The computation of the counters has a linear cost in time and memory with respect to Nx+Ny- 
As only nodes that correspond to suffixes of X and Y are created, recursive computation of T m is 
also linear (the values T m on non-existing nodes being equal to 1). As a result, the computation 
of the kernel is linear in time and space with respect to Nx + Ny- 

5 Properties 

Besides a fast implementation, the kernels (jHJ) has several interesting properties that we briefly 
mention in this section. 
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5.1 Sampling, reverse order and mismatches 



The kernel © is denned for elements of X which are not strings, but sets of pairs (context, letter). 
We can use this property to allow more flexibility in our kernel by modifying the representation 
of a sequence as a set of such pairs. We will use throughout this section a toy-example where we 
consider D = 2, E = {A,B,C} and the sequence ABC A. The straightforward representation of 
this sequence is the set X = {(AB, C), (BC, A)} with N x = 2. 

In the case where long sequences are compared and one wants to speed up the algorithm, let 
us first remark that it is possible to represent a sequence by a sub-sampling, typically random, of 
the original sequence. This would correspond to choosing randomly and uniformly a fixed number 
of positions in the sequences, and using the pairs (context, letter) at these positions. By the law of 
large number, the resulting kernel converges to the true value when the number of points sampled 
tends to infinity, and deviation bounds can be evaluated. 

Second, as typical motifs might be found in various directions for different sequences, one might 
be tempted to use both the string and its reverse ordered form, collecting transitions when reading 
the protein in both ways. Following our example, this would yield to X = {(AB, C), (BC, A), (AC, B), (CB, A)}. 

Third, recent works in protein homology detection have led to algorithms as well as kernels 
|16| |Sj that can handle minor substitutions between fc-mers 2 to detect resemblance beyond exact 
matching of strings. A simple way to achieve such a tolerance within the framework defined in 
this paper is to redefine X so that transitions stored in a set X are given a weighting factor. 
This means that X is redefined as X = U^ =0 (E D x E x [0, l]) n , and similarly an element X of 
X is written as X = {(x l , x' % , n l )}i=i..N x where each [i % is in [0,1]. Each encountered pair of 
context-string can thus be translated into a subset of fixed size of likely resemblant context-string 
pairs given a lesser weight in the overall computation. Given a d x d matrix of letter-to-letter 
substitution probabilities in amino-sequences, referring to for instance, one can derive a finite 
set of weighted neighbors from a simple context string. We note p(e\f) the probability of / being 
replaced by e when e//, both e and / in E. We also note p(e\e) = 1. Following our previous toy 
example, we can then consider a transition (AB, C,l) to be split into a set of weighted mismatch 
transitions {(AB, C, 1)} U {(ef, C, P(e\A)P(f\B)), e, f £ E} which can be added integrally to X or 
limited to the S most likely substitutions. By redefining our notations to be now 

N x 
N x 

p m (X) = J2fi i l(x i =m), 



i,e( X ) 



i=l 

J27Jl H i t(x i =m,x ,i =e) 



2 e ^ se 



if Pm (X) >0 



the complexity of our algorithm is still linear in (Nx + Ny) (but with a larger multiplicative 
constant) and allows our kernel to take now into account mismatches. 



5.2 Source coding and compression interpretation 

There is a very classical duality between source distributions (a random model to generate infinite 
sequences) and sequence compression Roughly speaking, if a finite sequence X has a probability 

2 a fc-mer is a substring of length k found in a sequence 
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P(X) of being generated by a source distribution P, then one can design a binary code to represent 
X by r(X) = — log 2 P(X) bits, up to a 2 bits, using for example arithmetic coding. 

When sequences are generated by an unknown source P, it is classical to form a coding source 
distribution by averaging several a priori sources. Under reasonable assumptions, one can design 
this way universal codes, in the sense that the average length of the codes be almost as short as 
if P was known and the best source was used. As an example, the context-tree weighting (CTW) 
algorithm |28j defines a coding probability P n for sequences by averaging source distributions defined 
by context trees as follows: 

P V (X) := < V ) [ PvA X )UM^s), (8) 

where uig is the Dirichlet prior with parameter 1/2, . . . , 1/2. Up to the mixture of Dirichlet and 
the exponents, we therefore see, by comparing © with ©, that our covariance kernel between 
two sequences can roughly be interpreted as the probability under P n of the concatenation of the 
two sequences. In terms of code length, — log K(X,Y) is roughly to r n (XY), the number of bits 
required by the CTW algorithm to compress X and Y concatenated together. 

Suppose now that the kernel is normalized as follows, to ensure K(X, X) = 1 for any sequence: 

k(X.Y)- K{X - Y > 



^/K(X,X)K{Y,Y) 
We then obtain that — log K(X, Y) is roughly equal to: 

r , m .a(H)Mn. (9) 

This non-negative number can be interpreted as the difference between the number of bits required 
to encode XY and the average of the numbers of bits required to encode XX and YY. 

In spite of its caveats, this derivation gives a useful intuition about the operation performed 
by the covariance kernel. It also suggests a general approach to derive a kernel for sequences from 
a compression algorithms, by compressing XX, YY and XY successively, and comparing their 
lengths with (jHJ). Finding conditions on the compression algorithm that ensure that this procedure 
leads to a valid kernel remain however an open problem that we are currently investigating. 



6 Experiments 

We report preliminary results concerning the performance of the covariance kernel on a benchmark 
experiment that tests the capacity of SVMs to detect remote homologies between protein domains. 
This is simulated by recognizing domains that are in the same SCOPJ2I (ver. 1.53) superfamily, 
but not in the same family, using the procedure described in We used the files compiled by 
the authors of |T^]. For each of the 54 families tested, we computed the ROC (Receiving Operator 
Characteristic) to measure the performance of a SVM based on the covariance kernel (the ROC 
score is the normalized area under the curve which plots the number of true positives as a function 
of false positives). We tested different parameters of our kernel, and compared its performance 
with the best mismatch kernel presented in ^Hj) that performs at a state-of-the-art accuracy level 
and can be implemented in linear time. The classification was led using the publicly available Gist 
2.0.5 implementation of SVM 3 . 

3 http : //microarray . cpmc . Columbia. edu/gist /download. html 
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Our covariance kernel has several parameters. The depth D, the width a and the branching 
probability e are the most elementary to play with; the selection of a Dirichlet mixture is a more 
difficult choice. Given the large number of parameters and the risk of overfitting the benchmark 
dataset by carefully optimizing them, we only report preliminary results with two settings. First 
we used a single Dirichlet distribution with parameters 1/2, ...,1/2, with D = 5, a = 5, e = 
0.5. Second, we used a basic 3 component Dirichlet mixture that models three classes of amino- 
acids (hydrophobic/hydrophilic/highly conserved). This mixture, called hydro-cons . 3comp, was 
downloaded from a Dirichlet mixture repository 4 . Other parameters were set to D = 4, a = 1 and 
e = 0.5. 

Figure [21 plots the total number of families for which a given methods exceeds a ROC score 
threshold. There is no significant difference between the three methods. The mismatch kernel seems 
to perform better on families with large ROC, while the covariance kernels tend to outperform the 
mismatch kernel for families with a ROC below 0.85. This observation is encouraging as it suggests 
that covariance kernels might be better adapted to difficult problems, corresponding to low sequence 
similarity, than the mismatch kernel. It is also worth mentioning that we did not test the variant 
suggested in Section 15.11 to take into account mismatches. The kernel is therefore only based on 
the same features as the spectrum kernel ^3] which is known to perform worse than the mismatch 
kernel tested. 
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Figure 2: Performance of three kernels on the problem of recognizing domain's superfamily. The 
curve shows the total number of families for which a given methods exceeds a ROC score threshold. 

7 Conclusion 

We introduced a novel class of kernels for sequences that are fast to compute and have the flexibility 
to include prior knowledge through the definition of probabilistic models and prior distribution. 

4 http : //www. cse .ucsc . edu/research/ compbio/dirichlets/ 
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The kernel is a covariance kernel based on a family of context-tree models, and makes a link be- 
tween the string kernels and the theory of universal source coding. On a benchmark experiment of 
remote homology detection it performs at a state-of-the-art level. Further accuracy improvements 
are expected from a more careful tuning of the parameters, on the one hand, and from the im- 
plementation of sampling strategies to manage mismatches and reverse-ordered similarities, on the 
other hand. 
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